BinaryBERT: Pushing the Limit of BERT Quantization

137

TABLE 5.5

Quantization results of BinaryBERT on SQuAD and MNLI-m.

Method

#Bits

Size

SQuAD-v1.1

MNLI-m

BERT-base

full-prec.

418

80.8/88.5

84.6

DistilBERT

full-prec.

250

79.1/86.9

81.6

LayerDrop-6L

full-prec.

328

-

82.9

LayerDrop-3L

full-prec.

224

-

78.6

TinyBERT-6L

full-prec.

55

79.7/87.5

82.8

ALBERT-E128

full-prec.

45

82.3/89.3

81.6

ALBERT-E768

full-prec.

120

81.5/88.6

82.0

Quant-Noise

PQ

38

-

83.6

Q-BERT

2/4-8-8

53

79.9/87.5

83.5

Q-BERT

2/3-8-8

46

79.3/87.0

81.8

Q-BERT

2-8-8

28

69.7/79.6

76.6

GOBO

3-4-32

43

-

83.7

GOBO

2-2-32

28

-

71.0

TernaryBERT

2-2-8

28

79.9/87.4

83.5

BinaryBERT

1-1-8

17

80.8/88.3

84.2

BinaryBERT

1-1-4

17

79.3/87.2

83.9

Then, the prediction-layer distillation minimizes the soft cross-entropy (SCE) between

quantized student logits ˆy and teacher logits y, i.e.,

pred = SCE(ˆy, y).

(5.25)

After splitting from the half-sized ternary model, the binary model inherits its perfor-

mance on a new architecture with full width. However, the original minimum of the ternary

model may not hold in this new loss landscape after splitting. Thus, the authors further

proposed to fine-tune the binary model with prediction-layer distillation to look for a better

solution.

For implementation, the authors took DynaBERT [89] sub-networks as backbones, of-

fering both half-sized and full-sized models for easy comparison. Firstly, a ternary model of

width 0.5× with the two-stage knowledge distillation is trained until convergence. Then, the

authors splited it into a binary model with width 1.0×, and perform further fine-tuning with

prediction-layer distillation. Table 5.5 compares their proposed BinaryBERT with a variety

of state-of-the-art counterparts, including Q-BERT [208], GOBO [279], Quant-Noise [65]

and TernaryBERT [285] for quantizing BERT on MNLI of GLUE [230] and SQuAD [198].

Aside from quantization, other general compression approaches are also compared such

as DistillBERT [206], LayerDrop [64], TinyBERT [106], and ALBERT [126]. BinaryBERT

has the smallest model size with the best performance among all quantization approaches.

Compared with the full-precision model, BinaryBERT retains competitive performance with

significantly reduced model size and computation. For example, it achieves more than 24×

compression ratio compared with BERT-base, with only 0.4%and 0.0%/0.2%drop on

MNLI-m and SQuAD v1.1, respectively.

In summary, this paper’s contributions can be concluded as: (1) The first work to explore

BERT binarization with an analysis for the performance drop of binarized BERT models. (2)

A ternary weight-splitting method splits a trained ternary BERT to initialize BinaryBERT,

followed by fine-tuning for further refinement.